Update splitter to fedora modules upstream and improve documentation.

The grobisplitter parts need some documentation to explain what they are doing and for whom. This is a first attempt at getting that right Signed-off-by: Stephen Smoogen <ssmoogen@redhat.com>
2020-12-03 16:22:44 -05:00 · 2020-12-03 16:22:44 -05:00 · ddb13e640a
commit ddb13e640a
parent 8b4c38e29e
4 changed files with 451 additions and 135 deletions
--- a/roles/grobisplitter/README.md
+++ b/roles/grobisplitter/README.md
@ -0,0 +1,183 @@
+# Grobisplitter
+### Or how I learned to stop worrying and love modules
+
+## Where are the sources 
+
+The Current Master Git Repository for the grobisplitter program is
+https://github.com/fedora-modularity/GrobiSplitter . The program
+depends upon python3 and some other programs.
+
+* gobject-introspection
+* libmodulemd-2.5.0
+* libmodulemd1-1.8.11
+* librepo
+* python3-gobject-base
+* python3-hawkey
+* python3-librepo
+
+## What does Grobisplitter splitter.py do?
+
+Grobisplitter was born out of the addition of modules to Fedora and
+RHEL-8. A module is a virtual rpm repository inside of a standard rpm
+repository where a sysadmin can choose which virtual repositories are
+used in a system or not. This allows for useful choices without having
+to add more repository configs, but it adds a complexity that the koji
+build system does not understand. While the MBS system could help
+handle this for packages it knows it built, it can not do so for ones
+that are external which is the case when building CentOS or EPEL
+packages. 
+
+Grobisplitter was created by Patrick Uiterwijk to deal with part of
+this while permanent solutions were created in MBS and
+koji. Grobisplitter takes a modular repository (as example a reposync
+copy of RHEL-8) and 'flattens' it out with each module becoming its
+own independent repository. Options to the command are
+
+``` shell
+[smooge@batcave01 RHEL-8-001]$ /usr/local/bin/splitter.py --help
+usage: splitter.py [-h] [--action {hardlink,symlink,copy}] [--target TARGET]
+                   [--skip-missing] [--create-repos] [--only-defaults]
+                   repository
+
+Split repositories up
+
+positional arguments:
+  repository            The repository to split
+
+optional arguments:
+  -h, --help            show this help message and exit
+  --action {hardlink,symlink,copy}
+                        Method to create split repos files
+  --target TARGET       Target directory for split repos
+  --skip-missing        Skip missing packages
+  --create-repos        Create repository metadatas
+  --only-defaults       Only output default modules
+
+```
+
+To save diskspace, one can use different methods to copy packages,
+target a specific directory, only allow for default modules, and
+create repos for each of the virtual repositories seperately. 
+
+Each module is split into a name matching its modular dataname, for
+example as of 2020-12-03, here are the httpd modules of RHEL-8 split out:
+
+``` shell
+
+[smooge@batcave01 RHEL-8-001]$ ls -1d httpd*
+httpd:2.4:8000020190405071959:55190bc5:x86_64/
+httpd:2.4:8000020190829150747:f8e95b4e:x86_64/
+httpd:2.4:8010020190829143335:cdc1202b:x86_64/
+httpd:2.4:8020020200122152618:6a468ee4:x86_64/
+httpd:2.4:8020020200824162909:4cda2c84:x86_64/
+httpd:2.4:8030020200818000036:30b713e6:x86_64/
+
+```
+
+The reason that there are multiple modules versus just the latest
+module was due to problems in knowing what the 'latest' module was to
+use. It needs to know about all the packages in the upstream
+repositories for modular decisions to be made. This means that the
+staged data will be a complete copy of the RHN repository.
+
+``` shell
+
+total 4980
+-rw-r--r--. 1 root sysadmin-main 1463679 2020-11-03 09:18 httpd-2.4.37-30.module+el8.3.0+7001+0766b9e7.x86_64.rpm
+-rw-r--r--. 1 root sysadmin-main  224591 2020-11-03 09:18 httpd-devel-2.4.37-30.module+el8.3.0+7001+0766b9e7.x86_64.rpm
+-rw-r--r--. 1 root sysadmin-main   37599 2020-11-03 09:18 httpd-filesystem-2.4.37-30.module+el8.3.0+7001+0766b9e7.noarch.rpm
+-rw-r--r--. 1 root sysadmin-main 2486719 2020-11-03 09:18 httpd-manual-2.4.37-30.module+el8.3.0+7001+0766b9e7.noarch.rpm
+-rw-r--r--. 1 root sysadmin-main  106479 2020-11-03 09:18 httpd-tools-2.4.37-30.module+el8.3.0+7001+0766b9e7.x86_64.rpm
+-rw-r--r--. 1 root sysadmin-main  157763 2020-11-03 09:18 mod_http2-1.15.7-2.module+el8.3.0+7670+8bf57d29.x86_64.rpm
+-rw-r--r--. 1 root sysadmin-main   84163 2020-11-03 09:18 mod_ldap-2.4.37-30.module+el8.3.0+7001+0766b9e7.x86_64.rpm
+-rw-r--r--. 1 root sysadmin-main  189343 2020-11-03 09:18 mod_md-2.0.8-8.module+el8.3.0+6814+67d1e611.x86_64.rpm
+-rw-r--r--. 1 root sysadmin-main   60531 2020-11-03 09:18 mod_proxy_html-2.4.37-30.module+el8.3.0+7001+0766b9e7.x86_64.rpm
+-rw-r--r--. 1 root sysadmin-main   72475 2020-11-03 09:18 mod_session-2.4.37-30.module+el8.3.0+7001+0766b9e7.x86_64.rpm
+-rw-r--r--. 1 root sysadmin-main  135799 2020-11-03 09:18 mod_ssl-2.4.37-30.module+el8.3.0+7001+0766b9e7.x86_64.rpm
+
+
+```
+
+All non-modular rpms from the repository are put in a directory called
+`non-modular` which can also have its own repodata set up for it.
+
+## What does rhel8-split.sh do?
+
+While the splitter command does the hard work of splitting out the
+packages, the rhel8-split.sh shell does the 'business' work of setting
+up the repositories so that koji can consume it for EPEL-8 and other
+builds.
+
+The first part of this is done by a cron job which reposyncs down from
+the Red Hat access.redhat.com the various packages for the
+architectures Fedora Infrastructure needs. The data is synced down
+into subdirectories in `/mnt/fedora/app/fi-repo/rhel/rhel8` which
+match channels in RHEL BaseOS, AppStream, CodeReadyBuilder as needed. 
+
+Next a new destination directory is made in
+`/mnt/fedora/app/fi-repo/rhel/rhel8/koji/` with the date of the cron
+job being run so that we can always roll back to an older external Red
+Hat repo if needed. Afterwards we begin breaking apart the repos per
+architecture. The splitter is then called per channel that is wanted
+to be used in EPEL. The Base and AppStream channel only splits out the
+'default' modules while the Code Ready Builder splits out all modules
+as many are non-default.
+
+After the files have been copied into a single tree a `createrepo_c`
+is run with the data. This creates a 'flattened' repository with data
+in it. However modular data from all these repos is currently lost.
+
+Once the data has been synced and flattened for all repositories, a
+series of links are set up that koji can point to. At this point a
+last reposync cycle is done using dnf to pull in only the newest
+rpms. This effectively cleans up large number of older packages to
+make sure the builders have an easier time deciding which package to
+use. [Basically as of 2020-12-03, the staged repo has 66130 packages
+in it, and the latest shrinks that down to 26530.]
+
+Koji then is pointed to the trees on batcave served from
+`/mnt/fedora/app/fi-repo/rhel/rhel8/koji/latest/${arch}/RHEL-8-001`.
+
+TODO:
+1. Currently the RHEL-8-001 is a consequence of the rhel8-split.sh
+   script. We split each repo into its own tree and then copy them
+   into one final one. This should be done better.
+2. A way to clean up the 'empty' directory names in latest would help
+   make it easier to see what is actually being 'used' by koji.
+   
+   ```
+[smooge@batcave01 latest]$ ls -1d x86_64/RHEL-8-001/go-toolset\:rhel8\:80*
+x86_64/RHEL-8-001/go-toolset:rhel8:8000020190509153318:b9255456:x86_64/
+x86_64/RHEL-8-001/go-toolset:rhel8:8000120190520160856:4a778a88:x86_64/
+x86_64/RHEL-8-001/go-toolset:rhel8:8000120190828225436:14bc675c:x86_64/
+x86_64/RHEL-8-001/go-toolset:rhel8:8010020190829001136:ccff3eb7:x86_64/
+x86_64/RHEL-8-001/go-toolset:rhel8:8010020191220185136:0ed30617:x86_64/
+x86_64/RHEL-8-001/go-toolset:rhel8:8020020200128163444:0ab52eed:x86_64/
+x86_64/RHEL-8-001/go-toolset:rhel8:8020020200817154239:02f7cb7a:x86_64/
+x86_64/RHEL-8-001/go-toolset:rhel8:8030020200827141259:13702366:x86_64/
+
+   ``` 
+   makes this look like it has lots of files .. however only one tree
+   has files in it.
+   ```
+
+[smooge@batcave01 latest]$ find x86_64/RHEL-8-001/go-toolset\:rhel8\:80*
+x86_64/RHEL-8-001/go-toolset:rhel8:8000020190509153318:b9255456:x86_64
+x86_64/RHEL-8-001/go-toolset:rhel8:8000120190520160856:4a778a88:x86_64
+x86_64/RHEL-8-001/go-toolset:rhel8:8000120190828225436:14bc675c:x86_64
+x86_64/RHEL-8-001/go-toolset:rhel8:8010020190829001136:ccff3eb7:x86_64
+x86_64/RHEL-8-001/go-toolset:rhel8:8010020191220185136:0ed30617:x86_64
+x86_64/RHEL-8-001/go-toolset:rhel8:8020020200128163444:0ab52eed:x86_64
+x86_64/RHEL-8-001/go-toolset:rhel8:8020020200817154239:02f7cb7a:x86_64
+x86_64/RHEL-8-001/go-toolset:rhel8:8030020200827141259:13702366:x86_64
+x86_64/RHEL-8-001/go-toolset:rhel8:8030020200827141259:13702366:x86_64/delve-1.4.1-1.module+el8.3.0+7840+63dfb1ed.x86_64.rpm
+x86_64/RHEL-8-001/go-toolset:rhel8:8030020200827141259:13702366:x86_64/go-toolset-1.14.7-1.module+el8.3.0+7840+63dfb1ed.x86_64.rpm
+x86_64/RHEL-8-001/go-toolset:rhel8:8030020200827141259:13702366:x86_64/golang-1.14.7-2.module+el8.3.0+7840+63dfb1ed.x86_64.rpm
+x86_64/RHEL-8-001/go-toolset:rhel8:8030020200827141259:13702366:x86_64/golang-bin-1.14.7-2.module+el8.3.0+7840+63dfb1ed.x86_64.rpm
+x86_64/RHEL-8-001/go-toolset:rhel8:8030020200827141259:13702366:x86_64/golang-docs-1.14.7-2.module+el8.3.0+7840+63dfb1ed.noarch.rpm
+x86_64/RHEL-8-001/go-toolset:rhel8:8030020200827141259:13702366:x86_64/golang-misc-1.14.7-2.module+el8.3.0+7840+63dfb1ed.noarch.rpm
+x86_64/RHEL-8-001/go-toolset:rhel8:8030020200827141259:13702366:x86_64/golang-race-1.14.7-2.module+el8.3.0+7840+63dfb1ed.x86_64.rpm
+x86_64/RHEL-8-001/go-toolset:rhel8:8030020200827141259:13702366:x86_64/golang-src-1.14.7-2.module+el8.3.0+7840+63dfb1ed.noarch.rpm
+x86_64/RHEL-8-001/go-toolset:rhel8:8030020200827141259:13702366:x86_64/golang-tests-1.14.7-2.module+el8.3.0+7840+63dfb1ed.noarch.rpm
+
+   ```
--- a/roles/grobisplitter/README.txt
+++ b/roles/grobisplitter/README.txt
@ -1,12 +0,0 @@
-The Current Master Git Repository for the grobisplitter program is
-https://github.com/smooge/GrobiSplitter.git to be moved under a
-Community Infrastructure repository later. The program depends upon
-python3 and other programs.
-
-gobject-introspection
-libmodulemd-2.5.0
-libmodulemd1-1.8.11
-librepo
-python3-gobject-base
-python3-hawkey
-python3-librepo
--- a/roles/grobisplitter/files/rhel8-split.sh
+++ b/roles/grobisplitter/files/rhel8-split.sh
@ -1,4 +1,6 @@
 #!/bin/bash
+
+## Setup basic environment variables.
 HOMEDIR=/mnt/fedora/app/fi-repo/rhel/rhel8
 BINDIR=/usr/local/bin

@ -7,6 +9,10 @@ DATE=$(date -Ih | sed 's/+.*//')

 DATEDIR=${HOMEDIR}/koji/${DATE}

+##
+## Make a directory for where the new tree will live. Use a new date
+## so that we can roll back to an older release or stop updates for
+## some time if needed. 
 if [ -d ${DATEDIR} ]; then
    echo "Directory already exists. Please remove or fix"
    exit
@ -14,6 +20,9 @@ else
 mkdir -p ${DATEDIR}
 fi

+##
+## Go through each architecture and 
+## 
 for ARCH in ${ARCHES}; do
    # The archdir is where we daily download updates for rhel8
    ARCHDIR=${HOMEDIR}/${ARCH}
--- a/roles/grobisplitter/files/splitter.py
+++ b/roles/grobisplitter/files/splitter.py
@ -12,32 +12,33 @@ import tempfile
 import os
 import subprocess
 import sys
+import logging

 # Look for a specific version of modulemd. The 1.x series does not
 # have the tools we need.
 try:
    gi.require_version('Modulemd', '2.0')
-    from gi.repository import Modulemd
-except:
-    print("We require newer vesions of modulemd than installed..")
-    sys.exit(0)
-    
-mmd = Modulemd
+    from gi.repository import Modulemd as mmd
+except ValueError:
+    print("libmodulemd 2.0 is not installed..")
+    sys.exit(1)

-# This code is from Stephen Gallagher to make my other caveman code
-# less icky.
-def _get_latest_streams (mymod, stream):
+# We only want to load the module metadata once. It can be reused as often as required
+_idx = None
+
+def _get_latest_streams(mymod, stream):
    """
    Routine takes modulemd object and a stream name.
    Finds the lates stream from that and returns that as a stream
-    object. 
+    object.
    """
    all_streams = mymod.search_streams(stream, 0)
    latest_streams = mymod.search_streams(stream,
-                                          all_streams[0].props.version) 
-    
+                                          all_streams[0].props.version)
+
    return latest_streams
-    
+
+
 def _get_repoinfo(directory):
    """
    A function which goes into the given directory and sets up the
@ -54,6 +55,46 @@ def _get_repoinfo(directory):
        r = h.perform()
        return r.getinfo(librepo.LRR_YUM_REPO)

+
+def _get_modulemd(directory=None, repo_info=None):
+    """
+    Retrieve the module metadata from this repository.
+    :param directory: The path to the repository. Must contain repodata/repomd.xml and modules.yaml.
+    :param repo_info: An already-acquired repo_info structure
+    :return: A Modulemd.ModulemdIndex object containing the module metadata from this repository.
+    """
+
+    # Return the cached value
+    global _idx
+    if _idx:
+        return _idx
+
+    # If we don't have a cached value, we need either directory or repo_info
+    assert directory or repo_info
+
+    if directory:
+        directory = os.path.abspath(directory)
+        repo_info = _get_repoinfo(directory)
+
+    if 'modules' not in repo_info:
+        return None
+
+    _idx = mmd.ModuleIndex.new()
+
+    with gzip.GzipFile(filename=repo_info['modules'], mode='r') as gzf:
+        mmdcts = gzf.read().decode('utf-8')
+        res, failures = _idx.update_from_string(mmdcts, True)
+        if len(failures) != 0:
+            raise Exception("YAML FAILURE: FAILURES: %s" % failures)
+        if not res:
+            raise Exception("YAML FAILURE: res != True")
+
+    # Ensure that every stream in the index is using v2
+    _idx.upgrade_streams(mmd.ModuleStreamVersionEnum.TWO)
+
+    return _idx
+
+
 def _get_hawkey_sack(repo_info):
    """
    A function to pull in the repository sack from hawkey.
@ -66,9 +107,10 @@ def _get_hawkey_sack(repo_info):

    primary_sack = hawkey.Sack()
    primary_sack.load_repo(hk_repo, build_cache=False)
-    
+
    return primary_sack

+
 def _get_filelist(package_sack):
    """
    Determine the file locations of all packages in the sack. Use the
@ -77,10 +119,12 @@ def _get_filelist(package_sack):
    """
    pkg_list = {}
    for pkg in hawkey.Query(package_sack):
-        nevr="%s-%s:%s-%s.%s"% (pkg.name,pkg.epoch,pkg.version,pkg.release,pkg.arch)
+        nevr = "%s-%s:%s-%s.%s" % (pkg.name, pkg.epoch,
+                                   pkg.version, pkg.release, pkg.arch)
        pkg_list[nevr] = pkg.location
    return pkg_list

+
 def _parse_repository_non_modular(package_sack, repo_info, modpkgset):
    """
    Simple routine to go through a repo, and figure out which packages
@ -97,20 +141,14 @@ def _parse_repository_non_modular(package_sack, repo_info, modpkgset):
        pkgs.add(pkg.location)
    return pkgs

-def _parse_repository_modular(repo_info,package_sack):
+
+def _parse_repository_modular(repo_info, package_sack):
    """
    Returns a dictionary of packages indexed by the modules they are
    contained in.
    """
    cts = {}
-    idx = mmd.ModuleIndex()
-    with gzip.GzipFile(filename=repo_info['modules'], mode='r') as gzf:
-        mmdcts = gzf.read().decode('utf-8')
-        res, failures = idx.update_from_string(mmdcts, True)
-        if len(failures) != 0:
-            raise Exception("YAML FAILURE: FAILURES: %s" % failures)
-        if not res:
-            raise Exception("YAML FAILURE: res != True")
+    idx = _get_modulemd(repo_info=repo_info)

    pkgs_list = _get_filelist(package_sack)
    idx.upgrade_streams(2)
@ -124,14 +162,14 @@ def _parse_repository_modular(repo_info,package_sack):
                else:
                    continue
            cts[stream.get_NSVCA()] = templ
-                
+
    return cts


 def _get_modular_pkgset(mod):
    """
    Takes a module and goes through the moduleset to determine which
-    packages are inside it. 
+    packages are inside it.
    Returns a list of packages
    """
    pkgs = set()
@ -142,6 +180,7 @@ def _get_modular_pkgset(mod):

    return list(pkgs)

+
 def _perform_action(src, dst, action):
    """
    Performs either a copy, hardlink or symlink of the file src to the
@ -160,6 +199,7 @@ def _perform_action(src, dst, action):
    elif action == 'symlink':
        os.symlink(src, dst)

+
 def validate_filenames(directory, repoinfo):
    """
    Take a directory and repository information. Test each file in
@ -176,107 +216,175 @@ def validate_filenames(directory, repoinfo):
    return isok


-def get_default_modules(directory):
+def _get_recursive_dependencies(all_deps, idx, stream, ignore_missing_deps):
+    if stream.get_NSVCA() in all_deps:
+        # We've already encountered this NSVCA, so don't go through it again
+        logging.debug('Already included {}'.format(stream.get_NSVCA()))
+        return
+
+    # Store this NSVCA/NS pair
+    local_deps = all_deps
+    local_deps.add(stream.get_NSVCA())
+
+    logging.debug("Recursive deps: {}".format(stream.get_NSVCA()))
+
+    # Loop through the dependencies for this stream
+    deps = stream.get_dependencies()
+
+    # At least one of the dependency array entries must exist in the repo
+    found_dep = False
+    for dep in deps:
+        # Within an array entry, all of the modules must be present in the
+        # index
+        found_all_modules = True
+        for modname in dep.get_runtime_modules():
+            # Ignore "platform" because it's special
+            if modname == "platform":
+                logging.debug('Skipping platform')
+                continue
+            logging.debug('Processing dependency on module {}'.format(modname))
+
+            mod = idx.get_module(modname)
+            if not mod:
+                # This module wasn't present in the index.
+                found_module = False
+                continue
+
+            # Within a module, at least one of the requested streams must be
+            # present
+            streamnames = dep.get_runtime_streams(modname)
+            found_stream = False
+            for streamname in streamnames:
+                stream_list = _get_latest_streams(mod, streamname)
+                for inner_stream in stream_list:
+                    try:
+                        _get_recursive_dependencies(
+                            local_deps, idx, inner_stream, ignore_missing_deps)
+                    except FileNotFoundError as e:
+                        # Could not find all of this stream's dependencies in
+                        # the repo
+                        continue
+                    found_stream = True
+
+            # None of the streams were found for this module
+            if not found_stream:
+                found_all_modules = False
+
+        # We've iterated through all of the modules; if it's still True, this
+        # dependency is consistent in the index
+        if found_all_modules:
+            found_dep = True
+
+    # We were unable to resolve the dependencies for any of the array entries.
+    # raise FileNotFoundError
+    if not found_dep and not ignore_missing_deps:
+        raise FileNotFoundError(
+            "Could not resolve dependencies for {}".format(
+                stream.get_NSVCA()))
+
+    all_deps.update(local_deps)
+
+
+def get_default_modules(directory, ignore_missing_deps):
    """
    Work through the list of modules and come up with a default set of
-    modules which would be the minimum to output. 
-    Returns a set of modules 
+    modules which would be the minimum to output.
+    Returns a set of modules
    """
-    directory = os.path.abspath(directory)
-    repo_info = _get_repoinfo(directory)

-    provides = set()
-    contents = set()
-    if 'modules' not in repo_info:
-        return contents
-    idx = mmd.ModuleIndex()
-    with gzip.GzipFile(filename=repo_info['modules'], mode='r') as gzf:
-        mmdcts = gzf.read().decode('utf-8')
-        res, failures = idx.update_from_string(mmdcts, True)
-        if len(failures) != 0:
-            raise Exception("YAML FAILURE: FAILURES: %s" % failures)
-        if not res:
-            raise Exception("YAML FAILURE: res != True")
+    all_deps = set()

-    idx.upgrade_streams(2)
+    idx = _get_modulemd(directory)
+    if not idx:
+        return all_deps

-    # OK this is cave-man no-sleep programming. I expect there is a
-    # better way to do this that would be a lot better. However after
-    # a long long day.. this is what I have.
-
-    # First we oo through the default streams and create a set of
-    # provides that we can check against later.
-    for modname in idx.get_default_streams():
+    for modname, streamname in idx.get_default_streams().items():
+        # Only the latest version of a stream is important, as that is the only one that DNF will consider in its
+        # transaction logic. We still need to handle each context individually.
        mod = idx.get_module(modname)
-        # Get the default streams and loop through them.
-        stream_set = mod.get_streams_by_stream_name(
-            mod.get_defaults().get_default_stream())
+        stream_set = _get_latest_streams(mod, streamname)
        for stream in stream_set:
-            tempstr = "%s:%s" % (stream.props.module_name,
-                                 stream.props.stream_name)
-            provides.add(tempstr)
+            # Different contexts have different dependencies
+            try:
+                logging.debug("Processing {}".format(stream.get_NSVCA()))
+                _get_recursive_dependencies(all_deps, idx, stream, ignore_missing_deps)
+                logging.debug("----------")
+            except FileNotFoundError as e:
+                # Not all dependencies could be satisfied
+                print(
+                    "Not all dependencies for {} could be satisfied. {}. Skipping".format(
+                        stream.get_NSVCA(), e))
+                continue
+
+    logging.debug('Default module streams: {}'.format(all_deps))
+
+    return all_deps


-    # Now go through our list and build up a content lists which will
-    # have only modules which have their dependencies met
-    tempdict = {}
-    for modname in idx.get_default_streams():
-        mod = idx.get_module(modname)
-        # Get the default streams and loop through them.
-        # This is a sorted list with the latest in it. We could drop
-        # looking at later ones here in a future version. (aka lines
-        # 237 to later)
-        stream_set = mod.get_streams_by_stream_name(
-            mod.get_defaults().get_default_stream())
-        for stream in stream_set:
-            ourname = stream.get_NSVCA()
-            tmp_name = "%s:%s" % (stream.props.module_name,
-                                 stream.props.stream_name)
-            # Get dependencies is a list of items. All of the modules
-            # seem to only have 1 item in them, but we should loop
-            # over the list anyway.
-            for deps in stream.get_dependencies():
-                isprovided = True # a variable to say this can be added.
-                for mod in deps.get_runtime_modules():
-                    tempstr=""
-                    # It does not seem easy to figure out what the
-                    # platform is so just assume we will meet it.
-                    if mod != 'platform':
-                        for stm in deps.get_runtime_streams(mod):
-                            tempstr = "%s:%s" %(mod,stm)
-                            if tempstr not in provides:
-                                # print( "%s : %s not found." % (ourname,tempstr))
-                                isprovided = False
-                    if isprovided:
-                        if tmp_name in tempdict:
-                            # print("We found %s" % tmp_name)
-                            # Get the stream version we are looking at
-                            ts1=ourname.split(":")[2]
-                            # Get the stream version we stored away
-                            ts2=tempdict[tmp_name].split(":")[2]
-                            # See if we got a newer one. We probably
-                            # don't as it is a sorted list but we
-                            # could have multiple contexts which would
-                            # change things.
-                            if ( int(ts1) > int(ts2) ):
-                                # print ("%s > %s newer for %s", ts1,ts2,ourname)
-                                tempdict[tmp_name] = ourname
-                        else:
-                            # print("We did not find %s" % tmp_name)
-                            tempdict[tmp_name] = ourname
-    # OK we finally got all our stream names we want to send back to
-    # our calling function. Read them out and add them to the set.
-    for indx in tempdict:
-        contents.add(tempdict[indx])
+def _pad_svca(svca, target_length):
+    """
+    If the split() doesn't return all values (e.g. arch is missing), pad it
+    with `None`
+    """
+    length = len(svca)
+    svca.extend([None] * (target_length - length))
+    return svca

-    return contents
+
+def _dump_modulemd(modname, yaml_file):
+    idx = _get_modulemd()
+    assert idx
+
+    # Create a new index to hold the information about this particular
+    # module and stream
+    new_idx = mmd.ModuleIndex.new()
+
+    # Add the module streams
+    module_name, *svca = modname.split(':')
+    stream_name, version, context, arch = _pad_svca(svca, 4)
+
+    logging.debug("Dumping YAML for {}, {}, {}, {}, {}".format(
+        module_name, stream_name, version, context, arch))
+
+    mod = idx.get_module(module_name)
+    streams = mod.search_streams(stream_name, int(version), context, arch)
+
+    # This should usually be a single item, but we'll be future-compatible
+    # and account for the possibility of having multiple streams here.
+    for stream in streams:
+        new_idx.add_module_stream(stream)
+
+    # Add the module defaults
+    defs = mod.get_defaults()
+    if defs:
+        new_idx.add_defaults(defs)
+
+    # libmodulemd doesn't currently expose the get_translation()
+    # function, but that will be added in 2.8.0
+    try:
+        # Add the translation object
+        translation = mod.get_translation()
+        if translation:
+            new_idx.add_translation(translation)
+    except AttributeError as e:
+        # This version of libmodulemd does not yet support this function.
+        # Just ignore it.
+        pass
+
+    # Write out the file
+    try:
+        with open(yaml_file, 'w') as output:
+            output.write(new_idx.dump_to_string())
+    except PermissionError as e:
+        logging.error("Could not write YAML to file: {}".format(e))
+        raise


 def perform_split(repos, args, def_modules):
    for modname in repos:
        if args.only_defaults and modname not in def_modules:
            continue
-        
+
        targetdir = os.path.join(args.target, modname)
        os.mkdir(targetdir)

@ -287,8 +395,12 @@ def perform_split(repos, args, def_modules):
                os.path.join(targetdir, pkgfile),
                args.action)

+        # Extract the modular metadata for this module
+        if modname != 'non_modular':
+            _dump_modulemd(modname, os.path.join(targetdir, 'modules.yaml'))

-def create_repos(target, repos,def_modules, only_defaults):
+
+def create_repos(target, repos, def_modules, only_defaults):
    """
    Routine to create repositories. Input is target directory and a
    list of repositories.
@ -297,9 +409,19 @@ def create_repos(target, repos,def_modules, only_defaults):
    for modname in repos:
        if only_defaults and modname not in def_modules:
            continue
+
+        targetdir = os.path.join(target, modname)
+
        subprocess.run([
-            'createrepo_c', os.path.join(target, modname),
+            'createrepo_c', targetdir,
            '--no-database'])
+        if modname != 'non_modular':
+            subprocess.run([
+                'modifyrepo_c',
+                '--mdtype=modules',
+                os.path.join(targetdir, 'modules.yaml'),
+                os.path.join(targetdir, 'repodata')
+            ])


 def parse_args():
@ -309,6 +431,8 @@ def parse_args():
    """
    parser = argparse.ArgumentParser(description='Split repositories up')
    parser.add_argument('repository', help='The repository to split')
+    parser.add_argument('--debug', help='Enable debug logging',
+                        action='store_true', default=False)
    parser.add_argument('--action', help='Method to create split repos files',
                        choices=('hardlink', 'symlink', 'copy'),
                        default='hardlink')
@ -319,6 +443,11 @@ def parse_args():
                        action='store_true', default=False)
    parser.add_argument('--only-defaults', help='Only output default modules',
                        action='store_true', default=False)
+    parser.add_argument('--ignore-missing-default-deps',
+                        help='When using --only-defaults, do not skip '
+                             'default streams whose dependencies cannot be '
+                             'resolved within this repository',
+                        action='store_true', default=False)
    return parser.parse_args()


@ -337,6 +466,7 @@ def setup_target(args):
        else:
            os.mkdir(args.target)

+
 def parse_repository(directory):
    """
    Parse a specific directory, returning a dict with keys module NSVC's and
@ -353,45 +483,51 @@ def parse_repository(directory):
    # If we have a repository with no modules we do not want our
    # script to error out but just remake the repository with
    # everything in a known sack (aka non_modular).
-     
+
    if 'modules' in repo_info:
-        mod = _parse_repository_modular(repo_info,package_sack)
+        mod = _parse_repository_modular(repo_info, package_sack)
        modpkgset = _get_modular_pkgset(mod)
    else:
        mod = dict()
        modpkgset = set()

-    non_modular = _parse_repository_non_modular(package_sack,repo_info, 
-                                  modpkgset) 
+    non_modular = _parse_repository_non_modular(package_sack, repo_info,
+                                                modpkgset)
    mod['non_modular'] = non_modular

-    ## We should probably go through our default modules here and
-    ## remove them from our mod. This would cut down some code paths.
+    # We should probably go through our default modules here and
+    # remove them from our mod. This would cut down some code paths.

    return mod

+
 def main():
-    # Determine what the arguments are and 
+    # Determine what the arguments are and
    args = parse_args()

+    if args.debug:
+        logging.basicConfig(level=logging.DEBUG)
+
    # Go through arguments and act on their values.
    setup_target(args)

    repos = parse_repository(args.repository)

    if args.only_defaults:
-        def_modules = get_default_modules(args.repository)
+        def_modules = get_default_modules(args.repository, args.ignore_missing_default_deps)
    else:
        def_modules = set()
-    def_modules.add('non_modular')        
-    
+
+    def_modules.add('non_modular')
+
    if not args.skip_missing:
        if not validate_filenames(args.repository, repos):
            raise ValueError("Package files were missing!")
    if args.target:
        perform_split(repos, args, def_modules)
        if args.create_repos:
-            create_repos(args.target, repos,def_modules,args.only_defaults)
+            create_repos(args.target, repos, def_modules, args.only_defaults)
+

 if __name__ == '__main__':
    main()