Updating Leeroy Root Passwords

The root passwords for Leeroy needed to be updated, without having access to the current passwords. Fortunately, this is a well-known problem with lots of software support. Both the host and guest OS root passwords needed to be updated.

Updating the Host

Updating a host root password is easy, as it’s something every good bad admin needs to do sometime in their life. There are lots of posts on it, including https://www.tecmint.com/reset-forgotten-root-password-in-debian/

Changing the password of the non-root account was also easy with the command passwd <theusername>

Updating the Guest

Updating the guest password is much harder.

I first considered using guestfish, which had instructions at links like the following:
https://unixhealthcheck.com/blog?id=424 https://tipsmake.com/use-guestfish-to-reset-the-forgotten-root-password-for-the-virtual-machine-in-qcow2-format-on-kvm

I guess that struck me as too easy, so I kept looking for other ideas. The tool virt-sysprep seemed to be designed to do exactly what I needed it to do (as a subset of its overall functionality), and have a greater likelihood of success (hah). The link that seemed best was https://www.cyberciti.biz/faq/reset-a-kvm-clone-virtual-machines-with-virt-sysprep-on-linux/. In particular, the command virt-sysprep -d testvm --root-password password:MySuperSecureRootPasswordHere seemed very straight-forward. The command executed until 44% and then exited. Usually that means a problem occurred, but after looking at other sample logs online, I think that was normal.

Unfortunately, I found that my lvm disk UUID was no longer discoverable by GRUB. GG.

Despite being caused by virt-sysprep there was little help when searching including that topic. The virt-sysprep execution log showed that the UUID for the disk was left alone (which it should have been due to the difficulty of figuring out where to update all UUID references inside a guest OS). However, something was still messed up.

So, I had to generalize my search to just grub. Fortunately, http://www.linuxandubuntu.com/home/ways-to-rescue-or-recover-grub-menu had well-written information on how to solve the problem. In short, use ls to find the mount, then type set prefix=<an lvm root>/boot/grub, insmod normal, normal. Then after booting, sudo grub-install /dev/vda.

Re-establishing SSH

Leeroy’s http port was forwarded to the guest OS correctly, but ssh was broken. It’s possible that this occurred weeks ago, and I was just now noticing. However, it seems more likely that virt-sysprep removed the ssh keys, since that’s part of what it does (and indeed, upon further inspection, the .ssh directories are missing from the users’ home directories). virt-sysprep also wiped out the server’s ssh keys. The solution was to generate new host keys, which was doable via the instructions at https://webby.land/2017/10/02/generate-new-host-keys-after-sysprep/. More specifically, dpkg-reconfigure openssh-server. Of course I also had to update my ~/.ssh/known_hosts file on the client, but that was trivial

Reflections

Of course the most obvious reflection is that I should have done a better job keeping the passwords. At the time I didn’t have a good system for storing passwords offline, but I’ve developed such a system since, so this is less likely to occur in the future.

Perhaps more interestingly, I’d thought about making a newer backup of the guest image (in fact, it was on my mental todo list). The problem was that I was out of space on the drive, and I didn’t want to go through the trouble of figuring out where the images were or where I’d mounted the extra hard drive (though df -h clued me in immediately once I realized that was the problem I was facing). I guess I expected virt-sysprep to work. In reflection, I should have known better. The upside is that I got to learn some new things and reinforce some old tricks (plus who’s to say whether guestfish would have worked either), and it took less than a couple hours. At the end of it all, that’s not so bad.

Advertisement

Leeroy Jenkins Jr Available

Leeroy Jenkins Jr. is now mostly working.  It’s a 64-core AMD 3990x Threadripper with 128GB of memory.  It’s currently accessible via http://leeroy2.defthm.com/

There are still some things I’m playing around with, e.g., how many processes make sense to run at once and moving it from under my desk to the closet.  However, it should be generally available.

I’ve moved every build over to Leeroy Jr.  They still exist on Leeroy Sr, but, with the exception of the master build, they’re disabled.

The plan is, at some point, to ship the machine to Kestrel, where it will be hosted.  I’ll be hosting it in my closet until then.  When it’s shipped, I’ll reactivate the builds on Leeroy Sr.  I probably won’t play around with the leeroy vs leeroy2 subdomain until after it’s been installed at Kestrel, so keep the link handy.

I’ll post additional updates (e..g, about downtime) in the ACL2 Community slack channel #continuous-integration-leeroy-jenkins.  If you’d like an invite to the slack server, please let me know.

IMG_20200726_011510.jpg

IMG_20200726_004649

Jenkins vs Teamcity vs Others

A Search for an Alternative to Jenkins

Every couple years, I tend to evaluate the ACL2 community’s build server.  Is it meeting the community’s needs?  Do we need more cores?  Is it secure, or am I likely to receive an email from the cloud provider notifying me that they think the machine is mining cryptocurrency?  This time around, I’ve wondered if we should use a different Continuous Integration (CI) solution.

One hackernews discussion provided some perspective: https://news.ycombinator.com/item?id=19781907.  There are also lots of search results for “alternatives to Jenkins.”  Forrester had a graphic in a report in 2017 that Gitlab cites.  It’s 3 years old at this point, but it at least does a nice job introducing the contenders:

forester-report-2017

Notes about Our Application and Community

Here are some relevant notes on our community.

  • We are an open-source community and would qualify for many CI provider’s “free” tiers.
  • Our ACL2 community development process is relatively light-weight and there is little tolerance for adding complexity to that process.  We have github issues and respond when new-comers create issues, but we hardly use them ourselves.  We would almost never be interested in CI pipelines, as they would add overhead to our currently light-weight contribution process.  Many of us use pull requests, and many of us push directly to one of the testing branches.
  • Re-using artifacts (built books/libraries) across builds saves a lot of build time.  We probably can’t reasonably use any solution that doesn’t easily allow us to re-use previously built artifacts.
  • Our build’s internal dependency graph is insane, and there’s little hope of making any CI software understand it.  My goal is not to cleanup the build — it’s to get/keep something working in a small amount of time.

 

The Status Quo: Jenkins

Jenkins used to be the cat’s meow.  It opened up exploring continuous integration by being accessible and free, and we’ve been using it since 2014.

Since then, I’ve come to learn how painful Jenkins’ security story is.  Maybe we’re willing to trust that the Jekins core is tightly maintained.  Unfortunately, I’m guessing it might be reasonably possible to compromise one of the lesser well-known plugins.  And since lots of plugins that we “need” depend on lots of other plugins, we’re basically stuck absorbing plugins with goofy and obscure names (an indicator, to me, of a potential lack of scrutiny).  Right now we’ve got 30 plugins installed, of which only a handful were explicitly installed by me (the rest being dependencies).  That’s a lot of attack surface!

Even if we assume that there’s no intentional compromise, Jekins and its plugins have lots of accidental security vulnerabilities.  It’s actually really good that Jenkins has a framework for indicating that a plugin, or Jekins itself, has a security vulnerability and needs to be updated.  The problem is that it’s an extra step to automatically update Jenkins and its plugins.  That being said, https://stackoverflow.com/questions/7709993/how-can-i-update-jenkins-plugins-from-the-terminal seems to have a good script for automating that process.

This lack of security is a little more acceptable when Jenkins is only accessible behind a firewall. However, it’s probably less acceptable when the cost of an intrusion via source-code is more expensive (i.e., when money is involved).  This lead me to look for alternatives.

Teamcity (Jetbrains)

I’ve had a deep appreciation for Jetbrains software and the accompanying sales model for a while now.  Their Java, GoLang, and C++ IDEs are really good.  And I appreciate that they’re willing to let the customer stop a product subscription after a year but still give the customer a perpetual license to the last version they paid for.  That’s really respectable.  Furthermore, I’m fed up enough with Jenkins s.t. I’d love to pay someone $100 a year to manage the security story for me.

Here’s what I learned about Teamcity

  • It was started in 2006.  That’s a long time ago.  Some parts seem well-modernized and others seem outdated.
  • It’s kind of clunky to configure — it’s hard to find the right blanks to fill out.  It always takes a while to learn a new interface, but this seemed particularly painful to me.
  • The way some features are implemented and explained ends up making things more confusing.  For example, you can easily monitor “testing-*” branches with a single build.  At first it seems strictly better (and indeed, for many readers, this is a feature not a liability), but supporting this feature adds an extra check-box/blank or two.  Once the blank that implements this feature is understood by the user, it’s not an issue.  However, it slowed me down a bit.  There are some other Teamcity features that were also clunky to learn.
  • Figuring out how to correctly merge changes from a testing branch into a main branch took a significant amount of time.
  • The documentation exists but could use improvement.
  • A lot of the stackoverflow posts on Teamcity topics tend to be from 2012-2013.  This is problematic for a couple reasons: (1) the information provided is outdated and (2) it suggests Teamcity isn’t used as much anymore. (Disclaimer: It could also mean that Teamcity is a perfect product and no one has questions about it anymore, but that seems unlikely.)
  • When I posted a question to the Teamcity support forums, there was no reply.  This suggests that Teamcity is not a high priority for Jetbrains.  Furthermore, the situation this question described was likely indicative of a bug.  Searching for similar posts indicated that other people had encountered similar problems (many years ago).  The fact that I was still hitting that problem is pretty bad.
  • There isn’t a nice presentation for a “matrix” build.  I was able to share the configuration by using “templates”, but boy is it clunky to navigate each configuration.  This is relevant because we run the cross-product of {acl2_vanilla acl2_parallel acl2_real acl2_parallel_real} x {CCL SBCL GCL}.  This means when the user logs in, they’re overwhelmed with ~12 of our least important builds and have to search for the other ~4 “main” builds that are much more important.  Maybe there’s a work-around related to the project hierarchy for that.
  • It doesn’t have a light-weight mechanism for presenting build artifacts (e.g., *.cert.out files) to the user.  I’d need to setup an artifact server (like Artifactory).  I’m guessing open-source artifact servers exist, but I’d rather skip the complexity.  We’re a humble open-source project, not an enterprise.  It’d be kind of neat to have a history of build artifacts (perhaps for easily determining what runes used to be used to prove a theorem), but it’s likely pretty useless for our community’s development process.
  • Evidently there are plug-ins for Teamcity that may help with some of the above problems, but I’ve learned to dislike the concept.
  • Teamcity provides a nice mechanism for emailing notifications upon successful/unsuccessful builds.  It will also restart failed builds automatically (for a specified number of tries) and only email regarding the build failure once.  Unfortunately, there’s no easy way to include the log from a “build step” in that email.  This means I’d need to add my own mail command to the build script, which is work I don’t want to do.  More importantly, it would add code to the project that someone else would then have to maintain if/when I get too busy — i.e., if the smtp server changes, then someone has to go find the code that contains that server name, which is harder than searching through a [well-designed] gui.

Impressions of Other Solutions

Here are my first impressions from other solutions.  They’re exactly that — first impressions.  I don’t want to spend an exhaustive amount of time to completely survey the space — I’m just trying to solve a problem in a reasonable amount of time.

Azure

Looking at the Azure cloud-ops solution, it looks rather simplistic, and I’m just guessing that it’s not MSFT’s main solution anymore.  They should probably be pouring their resources into Github, and they’re probably doing that.  Also, the Azure DevOps page seems to be pretty cloud oriented — e.g., they reference sharepoint on their page.  Further investigation would be needed to actually make a judgement.

TravisCI

The free tier for TravisCI has always appealed to me.  Here are the reasons we end up not using it.

  • There seems to be some sort of magical 120 minute limit to builds in TravisCI.   Since our full build takes well-beyond that, we wouldn’t qualify for the free TraviCI tier.  W.r.t. rolling out TravisCI ourselves, that limit can probably be changed.
  • It also seems that builds in TravisCI don’t re-use artifacts from previous builds.  Re-use of previous artifacts is a highly-desired feature in our situation, so this rules out TravisCI.
  • More investigation would be needed to really rule it out.

Gitlab

I really like gitlab.  Whereas github seemed to be resting on its first-to-market advantage, gitlab has been more innovative and also applied price pressure to github (my personal impression, but seemingly confirmed by Forrester).  That being said, it’s hard for me to imagine a world where it makes sense to use gitlab for the CI and github for actually hosting the repository, pull requests, etc.  Since the community strongly prefers to not move the repository, I’ve ruled out gitlab for now.

Github CI

At this point, I’ve run out of steam.  I’ve put this after Gitlab, but chronologically, I investigated it at the end.  It looks like CI is integrated via “Github Actions”.  Of the options that’s not Jenkins, this is the one I would spend time investigating next.  However, I’m guessing it will have some of the drawbacks mentioned in other solutions, so I’m going to save myself the time for now.

Cloudbees

  • “Built on the most widely used automation server in the world Jenkins™ – CloudBees CI (Core) provides flexible, governed CI/CD you can trust”.  Okay, they’ve got my attention, as that’s pretty much exactly what I want.
  • Upon further reading, it seems like CloudBees CI is maybe more about managing multiple jenkins servers than making Jenkins itself more reliable.
  • I can’t get to a trial of their product without “scheduling a demo.”  Umm, help is typically appreciated, but the fact that I can’t just download and run it is a red flag.  Off the cuff, it seems likely to be too expensive.  Plus, I don’t want to deal with all of the communication overhead.

CircleCI

It looks pretty slick but isn’t a great fit for us.  Here’s the two main things I learned.

  • They let you start with a VM, as opposed to a docker image.  This is convenient for me, since I haven’t dockerized all of the tools necessary to perform the complete build.  However, I have built a VM.
  • “CircleCI automatically runs your pipeline in a clean container or virtual machine, allowing you to test every commit” — this is actually a deal-breaker for us, as we want to re-use build artifacts across builds.

Conclusion

For now, I’ve decided to continue using Jenkins and automate the update process for Jenkins plugins and Jenkins itself.

I spent a lot of time trying to make Teamcity work, but I just can’t justify the turmoil given the lack of convenience.  The others are ruled out for reasons already described.  If/where I’ve misrepresented products, please let me know, and I’ll update the post.  Also, if you’ve read our requirements and know of another product I should consider, please also let me know.

Leeroy updated!

Leeroy has been updated to run on a VM with 8 cores and 52gb of RAM and is finally once again running the latest version of Jenkins.

One would think that moving a disk to a larger instance would be easy.  However, there were several complicating factors, which I describe here for future reference.

First, some script had managed to brick my sudo access to the machine.  Not only had said script inserted the same user 1300 times into one of the sudo auxiliary files, the script had managed to insert nulls into the auxiliary file (perhaps due to a reboot during writing).  This meant I’d get an error everytime I tried to run sudo, blocking any attempt to maintain jenkins or even the machine itself.  Starting a secondary VM; mounting the problematic disk; editing the auxiliary file; and creating a new VM for the new disk fixed that problem.

However creating a new VM sounds simpler than it is.  We use a “preemptible VM” to save costs.  This means the machine goes down automatically at least once every 24 hours.  We use a “managed instance template” to automatically restart this machine.  Thus, I had to rediscover the flags that would create the new template that uses the new disk.  And then I setup the managed instance group that “autoscales” the instance from 1 node up to 1 node (yes, you read that right… it’s a relatively maintainable hack that works 🙂 ).

Jenkins was also updated to the latest version by installing a new version of jenkins.war, all of the plugins were updated, and the old jenkins data has been cleaned out.

A journey with the preprocessor and defprod

How many fields does it take to break defprod?  Hint: you already know the answer.  42!

Why is this?  Suppose you have the following forms:

(include-book "centaur/fty/deftypes" :dir :system)
(fty::defprod slow-it-down
 ((a booleanp) (b booleanp) (c booleanp) (d booleanp) (e booleanp)
 (f booleanp) (g booleanp) (h booleanp) (i booleanp) (j booleanp)
 (k booleanp) (l booleanp) (m booleanp) (n booleanp) (o booleanp)
 (p booleanp) (q booleanp) (r booleanp) (s booleanp) (tt booleanp)
 (u booleanp) (v booleanp) (w booleanp) (xx booleanp) (y booleanp)
 (z booleanp) (aa booleanp) (ab booleanp) (ac booleanp) (ad booleanp)
 (ae booleanp) (af booleanp) (ag booleanp) (ah booleanp) (ai booleanp)
 (aj booleanp) (ak booleanp) (al booleanp) (am booleanp) (an booleanp)
 (ao booleanp) (ap booleanp)
 (aq booleanp) ;; the offending threshold
 ))

As of May 16, 2016, you’ll get a nice error that says the preprocessor’s rewrite stack has reached its limit and quit:

HARD ACL2 ERROR in PREPROCESS: The call depth limit of 1000 has been
exceeded in the ACL2 preprocessor (a sort of rewriter). There is probably
a loop caused by some set of enabled simple rules.

But there are no such looping rules (that I found).  This is evident by the error going away (and an analogous proof completes over a weekend in 130,000 seconds — hah!).

You might think it clever to disable the preprocessor.  That _might_ work, but there are three difficulties with that:

  1. There isn’t any published way to disable the preprocessor at the top-level (there may not be any), and we’re deep inside defprod
  2. Modifying defprod not to call the preprocessor in just the right place could break  other uses of defprod
  3. The subgoal generated when you do disable the preprocessor looks a lot like something a preprocessor could be good at (though, I suspect that in ACL2 we need generalization):
    (IMPLIES
     (AND
      (ALISTP X)
      (CONSP X)
      (CONSP (CDR X))
      (CONSP (CDDR X))
      (CONSP (CDDDR X))
      (CONSP (CDDDDR X))
      (CONSP (CDR (CDDDDR X)))
      (CONSP (CDDR (CDDDDR X)))
      (CONSP (CDDDR (CDDDDR X)))
      (CONSP (CDDDDR (CDDDDR X)))
      ...)
     (EQUAL (LIST (CONS 'B (<booleanp-fix> (CDR (CAR X))))
                  (CONS 'C (<booleanp-fix> (CDR (CADR X)))))
            X))

With what solution did I end up?

Change the format of the defprod from the default of being represented as an alist to being represented as a tree.

It’s still slow, and I’ll probably clean it up some more, but now it finishes in 666.04 seconds instead of 130,000.

Migrated Leeroy from home to Google Compute Engine

Leeroy (our community Jenkins server) is more or less migrated to Google Compute Engine.  When it’s up, it can be viewed at the same old URL, leeroy.defthm.com.

DSC04820 - cropped

Leeroy has been running in GCE for awhile now, but it took a bit of time to set it up the way I wanted.  Specifically, I wanted to make it a preemptible instance that was automatically restarted once resources became available again.  This costs about a quarter of what it would cost to have the instance up all the time.  Since our community is a humble community, it’s probably fine for the server to have some downtime.  Having it up most of the time will make committing to the “testing” branch more meaningful, as Leeroy automatically merges in changes that pass “testing” into “master.”

This frees up my home machine(s) but leaves me wondering… what am I going to do with all of this RAM @ home?

Leeroy is back!

The continuous integration server is back online.  Phew!  Turns out that either new subnets get assigned the “public” windows 7 firewall setting by default, or I selected “public” at some point without realizing it.

While updating my DHCP leases, I was really pleased to see that Synology web interface automatically redirected me to its new IP address, without requiring me to type it in.  Way to think of a border case and address it elegantly Synology!

http://leeroy.defthm.com

Also, I’ve decided to make leeroy’s read-only access public.

Leeroy under construction

I picked up a new Buffalo router that runs DD-WRT.  It seems there’s a bug with port forwarding in this router (although I get an IP assigned in a different IP block than with the old router), so leeroy should be considered under construction until I sort that out.

Leeroy

A continuous integration server for the ACL2 community is now hosted at http://leeroy.defthm.com.  To make my life easier (and not have to worry as much about security), you need an account to read the build status.  But, I’m happy to provide such an account upon request!

Beyond just informing us as to the status of the build, another nice perk of having a CI server is that the “acl2” build target functions as a gatekeeper for automatically pushing commits that are still able to build the manual from a “beta” branch to a “master” branch.  This lowers the amount of paranoia that I must maintain when committing changes.

.leeroy

The service is currently sponsored exclusively by Defthm Consulting, but we’re looking for financial sponsors who can help us scale up from 3 cores and 16GB of RAM to using hardware available via cloud services like Google Compute Engine and Amazon Web Services.  This would allow the community to determine the status of various ACL2 builds based upon building more than just arithmetic-2 (the current “acl2-multi” build target).  The estimated cost for this better service is somewhere around $5000 per annum.

A screenshot of the “acl2-multi” build status:

leeroy-acl2-multi